Adult¶

O dataset a ser analisado chama-se ADULT, e está disponível em https://archive.ics.uci.edu/dataset/2/adult.

Este dataset é bastante utilizado por estudantes de dados, principalmente no contexto de prever se a renda ultrapassa $50K/ano baseando-se em dados do censo americano, motivo pelo qual também pode ser encontrado como "Census Income Dataset".

Os dados foram coletados em 1994, e o dataset possui 48842 linhas e 14 colunas.

Obtendo e Carregando os dados¶

Como os dados estão disponíveis online, ele será baixado diretamente para o notebook.

In [1]:
%%bash

## Verica se existe o arquivo adult.zip e baixa do UCI, se necessário:
[ -e adult.zip ] && echo zip encontrado || wget https://archive.ics.uci.edu/static/public/2/adult.zip

## Verifica se o arquivo já foi descompactado e descompacta se necessário:
[ -e adult ] && echo zip descompactado || unzip adult.zip -d adult

ls -lah adult
zip encontrado
zip descompactado
total 5.8M
drwxr-xr-x 1 wesley wesley  102 Sep 24 00:46 .
drwxr-xr-x 1 wesley wesley  166 Nov 23 22:32 ..
-rwx------ 1 wesley wesley 3.8M May 22  2023 adult.data
-rwx------ 1 wesley wesley 5.2K May 22  2023 adult.names
-rwx------ 1 wesley wesley 2.0M May 22  2023 adult.test
-rwx------ 1 wesley wesley  140 May 22  2023 Index
-rwx------ 1 wesley wesley 4.2K May 22  2023 old.adult.names
In [2]:
## Vamos conhecer um pouco melhor os arquivos?
!head -n 15 adult/*
==> adult/adult.data <==
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, <=50K
37, Private, 284582, Masters, 14, Married-civ-spouse, Exec-managerial, Wife, White, Female, 0, 0, 40, United-States, <=50K
49, Private, 160187, 9th, 5, Married-spouse-absent, Other-service, Not-in-family, Black, Female, 0, 0, 16, Jamaica, <=50K
52, Self-emp-not-inc, 209642, HS-grad, 9, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 45, United-States, >50K
31, Private, 45781, Masters, 14, Never-married, Prof-specialty, Not-in-family, White, Female, 14084, 0, 50, United-States, >50K
42, Private, 159449, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 5178, 0, 40, United-States, >50K
37, Private, 280464, Some-college, 10, Married-civ-spouse, Exec-managerial, Husband, Black, Male, 0, 0, 80, United-States, >50K
30, State-gov, 141297, Bachelors, 13, Married-civ-spouse, Prof-specialty, Husband, Asian-Pac-Islander, Male, 0, 0, 40, India, >50K
23, Private, 122272, Bachelors, 13, Never-married, Adm-clerical, Own-child, White, Female, 0, 0, 30, United-States, <=50K
32, Private, 205019, Assoc-acdm, 12, Never-married, Sales, Not-in-family, Black, Male, 0, 0, 50, United-States, <=50K
40, Private, 121772, Assoc-voc, 11, Married-civ-spouse, Craft-repair, Husband, Asian-Pac-Islander, Male, 0, 0, 40, ?, >50K

==> adult/adult.names <==
| This data was extracted from the census bureau database found at
| http://www.census.gov/ftp/pub/DES/www/welcome.html
| Donor: Ronny Kohavi and Barry Becker,
|        Data Mining and Visualization
|        Silicon Graphics.
|        e-mail: ronnyk@sgi.com for questions.
| Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
| 48842 instances, mix of continuous and discrete    (train=32561, test=16281)
| 45222 if instances with unknown values are removed (train=30162, test=15060)
| Duplicate or conflicting instances : 6
| Class probabilities for adult.all file
| Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
| Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
|
| Extraction was done by Barry Becker from the 1994 Census database.  A set of

==> adult/adult.test <==
|1x3 Cross validator
25, Private, 226802, 11th, 7, Never-married, Machine-op-inspct, Own-child, Black, Male, 0, 0, 40, United-States, <=50K.
38, Private, 89814, HS-grad, 9, Married-civ-spouse, Farming-fishing, Husband, White, Male, 0, 0, 50, United-States, <=50K.
28, Local-gov, 336951, Assoc-acdm, 12, Married-civ-spouse, Protective-serv, Husband, White, Male, 0, 0, 40, United-States, >50K.
44, Private, 160323, Some-college, 10, Married-civ-spouse, Machine-op-inspct, Husband, Black, Male, 7688, 0, 40, United-States, >50K.
18, ?, 103497, Some-college, 10, Never-married, ?, Own-child, White, Female, 0, 0, 30, United-States, <=50K.
34, Private, 198693, 10th, 6, Never-married, Other-service, Not-in-family, White, Male, 0, 0, 30, United-States, <=50K.
29, ?, 227026, HS-grad, 9, Never-married, ?, Unmarried, Black, Male, 0, 0, 40, United-States, <=50K.
63, Self-emp-not-inc, 104626, Prof-school, 15, Married-civ-spouse, Prof-specialty, Husband, White, Male, 3103, 0, 32, United-States, >50K.
24, Private, 369667, Some-college, 10, Never-married, Other-service, Unmarried, White, Female, 0, 0, 40, United-States, <=50K.
55, Private, 104996, 7th-8th, 4, Married-civ-spouse, Craft-repair, Husband, White, Male, 0, 0, 10, United-States, <=50K.
65, Private, 184454, HS-grad, 9, Married-civ-spouse, Machine-op-inspct, Husband, White, Male, 6418, 0, 40, United-States, >50K.
36, Federal-gov, 212465, Bachelors, 13, Married-civ-spouse, Adm-clerical, Husband, White, Male, 0, 0, 40, United-States, <=50K.
26, Private, 82091, HS-grad, 9, Never-married, Adm-clerical, Not-in-family, White, Female, 0, 0, 39, United-States, <=50K.
58, ?, 299831, HS-grad, 9, Married-civ-spouse, ?, Husband, White, Male, 0, 0, 35, United-States, <=50K.

==> adult/Index <==
Index of adult

02 Dec 1996      140 Index
10 Aug 1996  3974305 adult.data
10 Aug 1996     4267 adult.names
10 Aug 1996  2003153 adult.test

==> adult/old.adult.names <==
1. Title of Database: adult
2. Sources:
   (a) Original owners of database (name/phone/snail address/email address)
       US Census Bureau.
   (b) Donor of database (name/phone/snail address/email address)
       Ronny Kohavi and Barry Becker, 
       Data Mining and Visualization
       Silicon Graphics.
       e-mail: ronnyk@sgi.com
   (c) Date received (databases may change over time without name change!)
       05/19/96
3. Past Usage:
   (a) Complete reference of article where it was described/used
        @inproceedings{kohavi-nbtree,
           author={Ron Kohavi},

Uma análise inicial dos arquivos mostra que os dados que interessam para a análise estão nos arquivos adult.test e adult.data. O arquivo de teste precisa ter a primeira linha ignorada. Os nomes das colunas estão no arquivo adult.names:

  • age: continuous.
  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • fnlwgt: continuous.
  • education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • education-num: continuous.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  • sex: Female, Male.
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours-per-week: continuous.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
  • above50k.-etherlands.

Para fazer a carga inicial utilizaremos o Pandas.

O código a seguir importa a biblioteca Pandas, define uma lista com os nomes das colunas e importa o aquivo de dados em um objeto DataFrame. Um cuidado especial foi tomado para que o Pandas interprete os valores indisponíveis, que nos documentos aparecem marcados com um sinal de interrogação.

In [3]:
import pandas as pd

features = 'age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country above50k'.split()
df = pd.read_csv('adult/adult.data', names=features, na_values=['?',' ?','  ?'], keep_default_na=True)
df.sample(20)
Out[3]:
age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country above50k
11118 36 Private 58602 5th-6th 3 Never-married Other-service Not-in-family White Male 0 0 35 United-States <=50K
27689 49 Private 200198 HS-grad 9 Married-civ-spouse Other-service Husband Black Male 0 0 40 United-States <=50K
14198 61 Private 273803 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States <=50K
8563 28 Private 194690 7th-8th 4 Separated Other-service Own-child White Male 0 0 60 Mexico <=50K
775 23 Local-gov 282579 Assoc-voc 11 Divorced Tech-support Not-in-family White Male 0 0 56 United-States <=50K
31725 30 Private 147596 Some-college 10 Divorced Adm-clerical Unmarried Black Female 0 0 40 United-States <=50K
27644 56 Local-gov 273084 Masters 14 Married-civ-spouse Exec-managerial Husband Black Male 0 0 40 United-States >50K
28774 51 Private 123053 Masters 14 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 5013 0 40 India <=50K
13248 68 Private 168794 Preschool 1 Never-married Machine-op-inspct Not-in-family White Male 0 0 10 United-States <=50K
453 42 Private 197583 Assoc-acdm 12 Married-civ-spouse Exec-managerial Husband Black Male 0 0 40 NaN >50K
17451 61 Private 85194 Some-college 10 Married-civ-spouse Tech-support Husband White Male 0 0 25 United-States <=50K
17223 20 Private 54012 HS-grad 9 Never-married Handlers-cleaners Own-child White Male 0 0 40 United-States <=50K
30502 60 Private 198727 HS-grad 9 Married-civ-spouse Prof-specialty Husband White Male 0 0 30 United-States <=50K
28657 46 Private 465974 11th 7 Never-married Transport-moving Own-child White Male 0 0 30 United-States <=50K
20179 54 Private 97778 7th-8th 4 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States <=50K
11659 59 Private 147989 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 NaN <=50K
23774 59 Private 98361 11th 7 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States <=50K
662 27 Private 111900 Some-college 10 Never-married Prof-specialty Not-in-family White Male 0 0 40 United-States <=50K
28496 31 Self-emp-not-inc 265807 Assoc-voc 11 Married-civ-spouse Exec-managerial Husband White Male 3137 0 50 United-States <=50K
18979 44 Private 86298 HS-grad 9 Divorced Craft-repair Not-in-family White Male 0 0 40 United-States <=50K

Há colunas numéricas e categóricas. Cada tipo de dado precisa ser analisado de forma distinta.

Vamos criar uma função para processar diferentes tipos de variáveis e mostrar as estatísticas descritivas. O que queremos saber de cada tipo de variável:

Categórica¶

  • Quantas linhas tem?
  • Quantas linhas faltando?
  • Proporção entre linhas válidas e NA (gráfico)
  • Quantas categorias tem?
  • Quais são as categorias?
  • Qual é a moda?
  • Quantos registros tem por categoria? (gráfico de barras)

Numérica¶

  • Quantas linhas tem?
  • Quantas linhas faltando?
  • Proporção entre linhas válidas e NA (gráfico)
  • Mínimo, máximo, média, mediana, desvio-padrão
  • Histograma
  • Boxplot
In [4]:
import plotly.express as px
import plotly.graph_objects as go

def analisa_var_categorica(nome, dataframe):
    v = dataframe[nome]
    na = sum(v.isna())
    print('\n\n')
    print('=' * 100, '\n')
    print(f'A variável {nome} possui {len(v)} linhas.')
    print(f'Deste total, {na} linhas contém valores nulos (NA/NaN/etc).')
    fig1 = px.pie(values=[len(v)-na, na], names=['Válidos', 'Nulos'], hole=.6, width=800)
    fig1.show()
    print(f'Ao todo, a série de dados possui {len(v.unique())} categorias distintas. São elas:')
    fig2 = px.bar(v.value_counts(), width=800)
    fig2.show()
    print(f'A moda desta variável é o valor {v.mode()[0]}.')

def analisa_var_numerica(nome, dataframe):
    v = dataframe[nome]
    na = sum(v.isna())
    print('\n\n')
    print('=' * 100, '\n')
    print(f'A variável {nome} possui {len(v)} linhas.')
    print(f'Deste total, {na} linhas contém valores nulos (NA/NaN/etc).')
    fig1 = px.pie(values=[len(v)-na, na], names=['Válidos', 'Nulos'], hole=.6, width=800)
    fig1.show()
    print(f'Mínimo: {v.min()}')
    print(f'Máximo: {v.max()}')
    print(f'Média: {v.mean()}')
    print(f'Mediana: {v.median()}')
    print(f'Desvio padrão: {v.std()}')
    fig2 = px.histogram(v, width=800)
    fig2.show()
    fig3 = px.box(v, width=800)
    fig3.show()

def analisa_dataframe(df):
    for c in df.columns:
        if df[c].dtype.kind in 'fi':
            analisa_var_numerica(c, df)
        else:
            analisa_var_categorica(c, df)  
In [5]:
analisa_dataframe(df)


==================================================================================================== 

A variável age possui 32561 linhas.
Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Mínimo: 17
Máximo: 90
Média: 38.58164675532078
Mediana: 37.0
Desvio padrão: 13.640432553581341


==================================================================================================== 

A variável workclass possui 32561 linhas.
Deste total, 1836 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 9 categorias distintas. São elas:
A moda desta variável é o valor  Private.



==================================================================================================== 

A variável fnlwgt possui 32561 linhas.
Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Mínimo: 12285
Máximo: 1484705
Média: 189778.36651208502
Mediana: 178356.0
Desvio padrão: 105549.97769702224


==================================================================================================== 

A variável education possui 32561 linhas.
Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 16 categorias distintas. São elas:
A moda desta variável é o valor  HS-grad.



==================================================================================================== 

A variável education_num possui 32561 linhas.
Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Mínimo: 1
Máximo: 16
Média: 10.0806793403151
Mediana: 10.0
Desvio padrão: 2.5727203320673877


==================================================================================================== 

A variável marital_status possui 32561 linhas.
Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 7 categorias distintas. São elas:
A moda desta variável é o valor  Married-civ-spouse.



==================================================================================================== 

A variável occupation possui 32561 linhas.
Deste total, 1843 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 15 categorias distintas. São elas:
A moda desta variável é o valor  Prof-specialty.



==================================================================================================== 

A variável relationship possui 32561 linhas.
Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 6 categorias distintas. São elas:
A moda desta variável é o valor  Husband.



==================================================================================================== 

A variável race possui 32561 linhas.
Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 5 categorias distintas. São elas:
A moda desta variável é o valor  White.



==================================================================================================== 

A variável sex possui 32561 linhas.
Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 2 categorias distintas. São elas:
A moda desta variável é o valor  Male.



==================================================================================================== 

A variável capital_gain possui 32561 linhas.
Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Mínimo: 0
Máximo: 99999
Média: 1077.6488437087312
Mediana: 0.0
Desvio padrão: 7385.292084840338


==================================================================================================== 

A variável capital_loss possui 32561 linhas.
Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Mínimo: 0
Máximo: 4356
Média: 87.303829734959
Mediana: 0.0
Desvio padrão: 402.9602186489998


==================================================================================================== 

A variável hours_per_week possui 32561 linhas.
Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Mínimo: 1
Máximo: 99
Média: 40.437455852092995
Mediana: 40.0
Desvio padrão: 12.347428681731843


==================================================================================================== 

A variável native_country possui 32561 linhas.
Deste total, 583 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 42 categorias distintas. São elas:
A moda desta variável é o valor  United-States.



==================================================================================================== 

A variável above50k possui 32561 linhas.
Deste total, 0 linhas contém valores nulos (NA/NaN/etc).
Ao todo, a série de dados possui 2 categorias distintas. São elas:
A moda desta variável é o valor  <=50K.

Predição¶

Vamos dividir a amostra entre dados de treino e teste e utilizar um algoritmo de propósito geral com boa eficiência para este tipo de problema, como as Decision Trees e Random Forests, por exemplo. Um ponto positivo destes algoritmos é que eles trazem a importância que cada feature tem para o resultado final no modelo.

In [6]:
df.sample(15)
Out[6]:
age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country above50k
18109 58 Self-emp-not-inc 193434 HS-grad 9 Married-civ-spouse Craft-repair Husband White Male 0 0 20 United-States <=50K
4613 77 NaN 232894 9th 5 Married-civ-spouse NaN Husband Black Male 0 0 40 United-States <=50K
26503 74 NaN 89667 Bachelors 13 Widowed NaN Not-in-family Other Female 0 0 35 United-States <=50K
5385 61 Private 173924 9th 5 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 Puerto-Rico >50K
16240 37 Private 183345 Assoc-voc 11 Never-married Craft-repair Not-in-family White Male 0 0 45 United-States <=50K
12047 37 Private 171090 9th 5 Married-civ-spouse Machine-op-inspct Wife Black Female 0 0 48 United-States <=50K
1942 24 Local-gov 249101 HS-grad 9 Divorced Protective-serv Unmarried Black Female 0 0 40 United-States <=50K
16188 58 Private 310320 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 40 United-States >50K
21748 38 Private 229236 HS-grad 9 Married-civ-spouse Transport-moving Husband Other Male 0 0 40 Puerto-Rico <=50K
19386 58 Self-emp-not-inc 130714 Masters 14 Married-civ-spouse Exec-managerial Husband White Male 0 0 50 United-States <=50K
15405 49 Local-gov 199378 HS-grad 9 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States <=50K
25410 45 Local-gov 53123 11th 7 Married-civ-spouse Other-service Wife White Female 0 0 25 United-States <=50K
22313 26 Self-emp-not-inc 258306 10th 6 Married-civ-spouse Farming-fishing Husband White Male 0 0 99 United-States <=50K
30630 53 Private 133219 HS-grad 9 Married-civ-spouse Other-service Husband Black Male 4386 0 30 United-States >50K
18717 26 Private 162302 Some-college 10 Never-married Machine-op-inspct Not-in-family Asian-Pac-Islander Male 0 0 40 Philippines <=50K
In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
In [8]:
X_train, X_test, y_train, y_test = train_test_split(pd.get_dummies(df.drop('above50k', axis=1)), df['above50k'], test_size=0.25)
In [9]:
X_train.sample(15)
Out[9]:
age fnlwgt education_num capital_gain capital_loss hours_per_week workclass_ Federal-gov workclass_ Local-gov workclass_ Never-worked workclass_ Private ... native_country_ Portugal native_country_ Puerto-Rico native_country_ Scotland native_country_ South native_country_ Taiwan native_country_ Thailand native_country_ Trinadad&Tobago native_country_ United-States native_country_ Vietnam native_country_ Yugoslavia
1222 53 22154 10 0 0 40 False False False True ... False False False False False False False True False False
14377 22 106700 12 0 0 27 False False False True ... False False False False False False False True False False
8562 49 122066 9 0 0 30 False False False True ... False False False False False False False False False False
1270 47 200734 13 0 0 45 False False False True ... False False False False False False False True False False
14360 24 26671 9 0 0 40 False False False False ... False False False False False False False True False False
26254 43 45156 10 0 0 60 False False False True ... False False False False False False False True False False
12621 62 197286 8 0 0 48 False False False True ... False False False False False False False False False False
25379 45 127089 14 0 0 45 False False False False ... False False False False False False False True False False
21917 54 227832 8 0 0 40 False False False True ... False False False False False False False True False False
728 31 223212 9 0 0 40 False False False True ... False False False False False False False False False False
32535 22 325033 8 0 0 35 False False False True ... False False False False False False False True False False
11565 43 150533 12 0 0 50 False False False False ... False False False False False False False True False False
14798 20 353195 9 0 0 35 False False False True ... False False False False False False False True False False
21934 65 94552 10 0 0 40 False False False True ... False False False False False False False True False False
22447 43 195897 9 7298 0 40 True False False False ... False False False False False False False True False False

15 rows × 105 columns

In [10]:
y_train.sample(15)
Out[10]:
26638     <=50K
14471      >50K
14240      >50K
23534     <=50K
22836     <=50K
9876      <=50K
21230     <=50K
26957      >50K
29012     <=50K
19016     <=50K
12995     <=50K
10781     <=50K
10902     <=50K
24764     <=50K
6730      <=50K
Name: above50k, dtype: object
In [11]:
model = RandomForestClassifier()
In [12]:
model.fit(X_train, y_train)
Out[12]:
RandomForestClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
In [13]:
y_hat = model.predict(X_test)
In [14]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
In [15]:
accuracy_score(y_test, y_hat)
Out[15]:
0.8556688367522417
In [16]:
confusion_matrix(y_test, y_hat)
Out[16]:
array([[5721,  469],
       [ 706, 1245]])
In [17]:
print(classification_report(y_test, y_hat))
              precision    recall  f1-score   support

       <=50K       0.89      0.92      0.91      6190
        >50K       0.73      0.64      0.68      1951

    accuracy                           0.86      8141
   macro avg       0.81      0.78      0.79      8141
weighted avg       0.85      0.86      0.85      8141

In [18]:
X_names = X_train.columns
fi = dict(zip(X_names, model.feature_importances_))
fi_sorted = sorted(fi.items(), key=lambda x:x[1], reverse=True)
top_fi = dict(fi_sorted[:15])
fig = px.bar(x=top_fi.keys(), y=top_fi.values(), title="Importância das variáveis")
fig.show()

Pergunta da Professora¶

O desempenho do algoritmo é igual para homens e para mulheres?

In [19]:
homens = df.query('sex == " Male"')
mulheres = df.query('sex == " Female"')
In [20]:
len(homens),len(mulheres)
Out[20]:
(21790, 10771)
In [21]:
X_train, X_test, y_train, y_test = train_test_split(pd.get_dummies(homens.drop('above50k', axis=1)), homens['above50k'], test_size=0.25)
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
print('===============================')
print('ANÁLISE PARA O GRUPO DOS HOMENS')
print('===============================')
print('Acurácia:', accuracy_score(y_test, y_hat))
print('Matriz de confusão:\n', confusion_matrix(y_test, y_hat))
print('Relatório de classificação:\n', classification_report(y_test, y_hat))
===============================
ANÁLISE PARA O GRUPO DOS HOMENS
===============================
Acurácia: 0.8166299559471366
Matriz de confusão:
 [[3395  366]
 [ 633 1054]]
Relatório de classificação:
               precision    recall  f1-score   support

       <=50K       0.84      0.90      0.87      3761
        >50K       0.74      0.62      0.68      1687

    accuracy                           0.82      5448
   macro avg       0.79      0.76      0.78      5448
weighted avg       0.81      0.82      0.81      5448

In [22]:
X_train, X_test, y_train, y_test = train_test_split(pd.get_dummies(mulheres.drop('above50k', axis=1)), mulheres['above50k'], test_size=0.25)
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
print('=================================')
print('ANÁLISE PARA O GRUPO DAS MULHERES')
print('=================================')
print('Acurácia:', accuracy_score(y_test, y_hat))
print('Matriz de confusão:\n', confusion_matrix(y_test, y_hat))
print('Relatório de classificação:\n', classification_report(y_test, y_hat))
=================================
ANÁLISE PARA O GRUPO DAS MULHERES
=================================
Acurácia: 0.9301893798737467
Matriz de confusão:
 [[2339   42]
 [ 146  166]]
Relatório de classificação:
               precision    recall  f1-score   support

       <=50K       0.94      0.98      0.96      2381
        >50K       0.80      0.53      0.64       312

    accuracy                           0.93      2693
   macro avg       0.87      0.76      0.80      2693
weighted avg       0.92      0.93      0.92      2693